In [1]:
import numpy as np
In [2]:
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
In [3]:
# Intializing token index as an empty dictionary
token_index = {}
In [4]:
# Testing result of function .split()
samples[0].split()
Out[4]:
In [5]:
for sample in samples:
# Each sample is split into words.
# Currently, punctuation is not ommited
for word in sample.split():
if word not in token_index:
# Assign a unique index to each unique word
# Index 0 is not assigned to anything.
token_index[word] = len(token_index) + 1
In [6]:
token_index
Out[6]:
In [7]:
# Taking into consideration only first 10 words in each sentence.
max_length = 10
In [8]:
# Initializing the result array with zeros
# It will be of shape (number_of_samples, max_length_taken_into_consideration, number_of_unique_words)
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
In [9]:
# Enumerating through samples and words
# One-hot encoding
for i, sample in enumerate(samples):
for j, word in list(enumerate(sample.split()))[:max_length]:
index = token_index.get(word)
results[i, j, index] = 1.
In [10]:
results
Out[10]:
In [11]:
import string
In [12]:
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
In [13]:
# Assigning all prinatable ASCII characters
characters = string.printable
In [14]:
characters
Out[14]:
In [15]:
# Tokenizing the characters
token_index = dict(zip(characters, range(1, len(characters) + 1)))
In [16]:
token_index
Out[16]:
In [17]:
# Take intro consideration only first 50 character of the sentence
max_length = 50
In [18]:
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
In [19]:
for i, sample in enumerate(samples):
for j, character in enumerate(sample[:max_length]):
index = token_index.get(character)
results[i, j, index] = 1.
In [20]:
results
Out[20]:
In [21]:
# Importing Keras Tokenizer
from keras.preprocessing.text import Tokenizer
In [22]:
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
In [23]:
# Intializing tokenizer, which will take into account only 1000 most commonly used words.
tokenizer = Tokenizer(num_words = 1000)
In [24]:
# Building the dictionary
tokenizer.fit_on_texts(samples)
In [25]:
# Turning the sequences to an array of integers corresponding to the unique words
sequences = tokenizer.texts_to_sequences(samples)
In [26]:
sequences
Out[26]:
In [27]:
# Representing the data as one-hot encoded.
one_hot_results = tokenizer.texts_to_matrix(samples,
mode = 'binary')
In [28]:
one_hot_results
Out[28]:
In [29]:
# Retrieving the index
word_index = tokenizer.word_index
In [30]:
word_index
Out[30]:
"A variant of one-hot encoding is the so-called "one-hot hashing trick", which can be used when the number of unique tokens in your vocabulary is too large to handle explicitly. Instead of explicitly assigning an index to each word and keeping a reference of these indices in a dictionary, one may hash words into vectors of fixed size. This is typically done with a very lightweight hashing function. The main advantage of this method is that it does away with maintaining an explicit word index, which saves memory and allows online encoding of the data (starting to generate token vectors right away, before having seen all of the available data). The one drawback of this method is that it is susceptible to "hash collisions": two different words may end up with the same hash, and subsequently any machine learning model looking at these hashes won't be able to tell the difference between these words. The likelihood of hash collisions decreases when the dimensionality of the hashing space is much larger than the total number of unique tokens being hashed."
In [31]:
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
In [32]:
# Storing words as vector of size 1000.
# Possible hash collisions and the accuracy of the encoding will drop.
dimensionality = 1000
max_length = 10
results = np.zeros((len(samples), max_length, dimensionality))
In [33]:
for i, sample in enumerate(samples):
for j, word in list(enumerate(sample.split()))[:max_length]:
# Hash the word into a "random" integer index
# that is between 0 and 1000
index = abs(hash(word)) % dimensionality
results[i, j, index] = 1.
In [34]:
results
Out[34]:
In [35]:
# 2 examples, each of size 10 (or less, but encoding will persist) with one-hot hashed words
results.shape
Out[35]: